137 research outputs found
Recommended from our members
Large-scale social-media analytics on stratosphere
The importance of social-media platforms and online communities - in business as well as public context - is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data processing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows
Dynamic Parameter Allocation in Parameter Servers
To keep up with increasing dataset sizes and model complexity, distributed
training has become a necessity for large machine learning tasks. Parameter
servers ease the implementation of distributed parameter management---a key
concern in distributed training---, but can induce severe communication
overhead. To reduce communication overhead, distributed machine learning
algorithms use techniques to increase parameter access locality (PAL),
achieving up to linear speed-ups. We found that existing parameter servers
provide only limited support for PAL techniques, however, and therefore prevent
efficient training. In this paper, we explore whether and to what extent PAL
techniques can be supported, and whether such support is beneficial. We propose
to integrate dynamic parameter allocation into parameter servers, describe an
efficient implementation of such a parameter server called Lapse, and
experimentally compare its performance to existing parameter servers across a
number of machine learning tasks. We found that Lapse provides near-linear
scaling and can be orders of magnitude faster than existing parameter servers
Benchmarking Distributed Stream Data Processing Systems
The need for scalable and efficient stream analysis has led to the
development of many open-source streaming data processing systems (SDPSs) with
highly diverging capabilities and performance characteristics. While first
initiatives try to compare the systems for simple workloads, there is a clear
gap of detailed analyses of the systems' performance characteristics. In this
paper, we propose a framework for benchmarking distributed stream processing
engines. We use our suite to evaluate the performance of three widely used
SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our
evaluation focuses in particular on measuring the throughput and latency of
windowed operations, which are the basic type of operations in stream
analytics. For this benchmark, we design workloads based on real-life,
industrial use-cases inspired by the online gaming industry. The contribution
of our work is threefold. First, we give a definition of latency and throughput
for stateful operators. Second, we carefully separate the system under test and
driver, in order to correctly represent the open world model of typical stream
processing deployments and can, therefore, measure system performance under
realistic conditions. Third, we build the first benchmarking framework to
define and test the sustainable performance of streaming systems.
Our detailed evaluation highlights the individual characteristics and
use-cases of each system.Comment: Published at ICDE 201
A Survey on Transactional Stream Processing
Transactional stream processing (TSP) strives to create a cohesive model that
merges the advantages of both transactional and stream-oriented guarantees.
Over the past decade, numerous endeavors have contributed to the evolution of
TSP solutions, uncovering similarities and distinctions among them. Despite
these advances, a universally accepted standard approach for integrating
transactional functionality with stream processing remains to be established.
Existing TSP solutions predominantly concentrate on specific application
characteristics and involve complex design trade-offs. This survey intends to
introduce TSP and present our perspective on its future progression. Our
primary goals are twofold: to provide insights into the diverse TSP
requirements and methodologies, and to inspire the design and development of
groundbreaking TSP systems
08421 Abstracts Collection -- Uncertainty Management in Information Systems
From October 12 to 17, 2008 the Dagstuhl Seminar 08421 \u27`Uncertainty Management in Information Systems \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. The abstracts of the plenary and session talks given during the seminar as well as those of the shown demos are put together in this paper
Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.This paper presents the BigEarthNet that is a new large-scale multi-label Sentinel-2 benchmark archive. The BigEarthNet consists of 590, 326 Sentinel-2 image patches, each of which is a section of i) 120 × 120 pixels for 10m bands; ii) 60×60 pixels for 20m bands; and iii) 20×20 pixels for 60m bands. Unlike most of the existing archives, each image patch is annotated by multiple land-cover classes (i.e., multi-labels) that are provided from the CORINE Land Cover database of the year 2018 (CLC 2018). The BigEarthNet is significantly larger than the existing archives in remote sensing (RS) and thus is much more convenient to be used as a training source in the context of deep learning. This paper first addresses the limitations of the existing archives and then describes the properties of the BigEarthNet. Experimental results obtained in the framework of RS image scene classification problems show that a shallow Convolutional Neural Network (CNN) architecture trained on the BigEarthNet provides much higher accuracy compared to a state-of-the-art CNN model pre-trained on the ImageNet (which is a very popular large-scale benchmark archive in computer vision). The BigEarthNet opens up promising directions to advance operational RS applications and research in massive Sentinel-2 image archives.EC/H2020/759764/EU/Accurate and Scalable Processing of Big Data in Earth Observation/BigEarthBMBF, 01IS14013A, Verbundprojekt: BBDC - Berliner Kompetenzzentrum für Big Dat
10381 Summary and Abstracts Collection -- Robust Query Processing
Dagstuhl seminar 10381 on robust query processing (held 19.09.10 -
24.09.10) brought together a diverse set of researchers and practitioners
with a broad range of expertise for the purpose of fostering discussion
and collaboration regarding causes, opportunities, and solutions for
achieving robust query processing.
The seminar strove to build a unified view across
the loosely-coupled system components responsible for
the various stages of database query processing.
Participants were chosen for their experience with database
query processing and, where possible, their prior work in academic
research or in product development towards robustness in database query
processing.
In order to pave the way to motivate, measure, and protect future advances
in robust query processing, seminar 10381 focused on developing tests
for measuring the robustness of query processing.
In these proceedings, we first review the seminar topics, goals,
and results, then present abstracts or notes of some of the seminar break-out
sessions.
We also include, as an appendix,
the robust query processing reading list that
was collected and distributed to participants before the seminar began,
as well as summaries of a few of those papers that were
contributed by some participants
Spinning Fast Iterative Data Flows
Parallel dataflow systems are a central part of most analytic pipelines for
big data. The iterative nature of many analysis and machine learning
algorithms, however, is still a challenge for current systems. While certain
types of bulk iterative algorithms are supported by novel dataflow frameworks,
these systems cannot exploit computational dependencies present in many
algorithms, such as graph algorithms. As a result, these algorithms are
inefficiently executed and have led to specialized systems based on other
paradigms, such as message passing or shared memory. We propose a method to
integrate incremental iterations, a form of workset iterations, with parallel
dataflows. After showing how to integrate bulk iterations into a dataflow
system and its optimizer, we present an extension to the programming model for
incremental iterations. The extension alleviates for the lack of mutable state
in dataflows and allows for exploiting the sparse computational dependencies
inherent in many iterative algorithms. The evaluation of a prototypical
implementation shows that those aspects lead to up to two orders of magnitude
speedup in algorithm runtime, when exploited. In our experiments, the improved
dataflow system is highly competitive with specialized systems while
maintaining a transparent and unified dataflow abstraction.Comment: VLDB201
- …